Debug about incremental indexing and also MCP#26
Conversation
There was a problem hiding this comment.
Pull request overview
This PR tightens incremental indexing and MCP server behaviors to avoid mixed in-memory artifacts, improve repo naming stability for local paths, and eliminate stdout print() noise (important for MCP stdio).
Changes:
- Add stable local-path repo naming (
get_repo_name_from_path) with legacy index reuse and collision detection in MCP flows. - Harden indexing/caching flows: rebuild vector/graph state cleanly per repo, improve multi-repo cache loading correctness, and add incremental reindex safeguards (embedding dimension checks, effective repo_root, overview refresh, manifest persistence).
- Add a comprehensive regression test suite covering incremental reindexing + MCP behaviors and verifying no
print()calls in critical paths.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_incremental_indexing_regressions.py | New regression tests for incremental reindexing, MCP fallback logic, repo naming stability, and stdout safety. |
| mcp_server.py | Canonical repo naming for local paths, clean full-reindex helper, incremental fallback logic, stricter in-memory load validation, and enable incremental refresh for lookup tools. |
| fastcode/utils.py | Introduces get_repo_name_from_path() for stable, disambiguated local repo identifiers. |
| fastcode/loader.py | Uses get_repo_name_from_path() to name workspace copies consistently and support reuse of workspace copies. |
| fastcode/main.py | Ensures clean per-repo in-memory indexing, fresh graph builder usage, safer cache-load rebuilds, multi-repo cache hardening, and more robust incremental reindexing. |
| fastcode/vector_store.py | Replaces stdout print() with logger debug output in repo overview search. |
| fastcode/retriever.py | Replaces stdout print() with structured logging during repo selection. |
| fastcode/query_processor.py | Removes stdout print() of LLM response in favor of debug logging. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # failures cannot leave stale repos marked as loaded. | ||
| self.loaded_repositories = {} | ||
| self.repo_indexed = False | ||
| self.repo_loaded = False | ||
| self.multi_repo_mode = False | ||
| if hasattr(self.retriever, "current_loaded_repos"): | ||
| self.retriever.current_loaded_repos = None |
There was a problem hiding this comment.
In _load_multi_repo_cache, the in-memory merge reinitializes self.vector_store and resets self.retriever.current_loaded_repos, but it does not clear HybridRetriever's filtered caches (filtered_vector_store / filtered_bm25*). If a prior request loaded a filtered subset, subsequent queries without an explicit repo_filter can still use the stale filtered_vector_store (because _semantic_search prefers it when non-empty), returning incomplete results after a reload. Clear these filtered caches whenever rebuilding the merged in-memory index (either by calling invalidate_in_memory_state() here or explicitly resetting the filtered fields).
| # failures cannot leave stale repos marked as loaded. | |
| self.loaded_repositories = {} | |
| self.repo_indexed = False | |
| self.repo_loaded = False | |
| self.multi_repo_mode = False | |
| if hasattr(self.retriever, "current_loaded_repos"): | |
| self.retriever.current_loaded_repos = None | |
| # failures cannot leave stale repos marked as loaded or stale | |
| # filtered caches on the retriever. | |
| self.loaded_repositories = {} | |
| self.repo_indexed = False | |
| self.repo_loaded = False | |
| self.multi_repo_mode = False | |
| # Reset retriever state that depends on the in-memory index | |
| if hasattr(self.retriever, "current_loaded_repos"): | |
| self.retriever.current_loaded_repos = None | |
| # Clear any filtered vector store used for subset queries | |
| if hasattr(self.retriever, "filtered_vector_store"): | |
| self.retriever.filtered_vector_store = None | |
| # Clear all BM25-based filtered caches (e.g., filtered_bm25_corpus, | |
| # filtered_bm25_index, etc.) so subsequent queries use the fresh index. | |
| for attr_name in dir(self.retriever): | |
| if attr_name.startswith("filtered_bm25"): | |
| setattr(self.retriever, attr_name, None) |
| def strip_repo_hash_suffix(repo_name: str) -> Optional[str]: | ||
| base_name, sep, suffix = repo_name.rpartition("-") | ||
| if not sep or len(suffix) != 8: | ||
| return None | ||
| if not all(ch in "0123456789abcdef" for ch in suffix.lower()): | ||
| return None | ||
| return base_name or None | ||
|
|
There was a problem hiding this comment.
strip_repo_hash_suffix() is implemented twice in this file (once inside _repo_name_from_source and again inside _ensure_repos_ready). Duplicating the parsing rules increases the chance these helpers drift over time (e.g., suffix length/format changes) and creates subtle behavior mismatches. Consider extracting a single module-level helper and reusing it from both call sites.
| def strip_repo_hash_suffix(repo_name: str) -> Optional[str]: | |
| base_name, sep, suffix = repo_name.rpartition("-") | |
| if not sep or len(suffix) != 8: | |
| return None | |
| if not all(ch in "0123456789abcdef" for ch in suffix.lower()): | |
| return None | |
| return base_name or None |
No description provided.